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Portable  Implementation  of  a  Generic 
Exponential  Function  in  Ada 

by 

Ping  Tak  Peter  Tang 


Abstract 

By  presenting  a  provably  accurate  implementation  of  the  exponential  function,  we  illus¬ 
trate  that  an  accurate  and  portable  elementary-function  library  can  be  implemented  in 
Ada. 

1  Introduction 

Since  July  of  1986,  the  SIGAda  Numerics  Working  Group  has  been  working  on  a  speci¬ 
fication  for  the  elementary  transcendental  functions  for  Ada.  The  specification  includes 
requirements  on  accuracies  for  the  various  functions.  Our  interest  lies  not  only  in  for¬ 
mulating  the  specification,  but  also  in  demonstrating  portable  implementations  that  are 
reasonably  accurate  and  efficient.  One  of  the  goals  is  to  illustrate  that  the  proposed 
specification  is  well  formulated,  by  implementing  an  elementary-function  library  that  is 
both  portable  and  provably  accurate  to  well  within  the  requirements  of  the  specification. 

This  is  a  report  on  the  exponential  function  that  we  implemented  in  conformance 
with  the  specification  dated  December  6,  1987.  (Although  the  specification  is  still  evolv¬ 
ing,  we  expect  further  changes  to  be  minor.)  The  next  two  sections  discuss  two  different 
problems  related  to  the  construction  of  portable  generic  libraries.  Solutions  to  the  prob¬ 
lems  are  also  presented.  These  solutions  will  be  applicable  not  only  to  the  exponential 
function  but  also  to  the  other  functions.  Sections  4  and  5  present  the  algorithm  and  the 
implementation  details  for  the  exponential  function.  Section  6  analyzes  the  implemen¬ 
tation  and  provides  an  error  bound  for  the  computed  result.  Section  7  presents  some 
results  obtained  from  a  number  of  tests  performed  on  the  function.  Finally,  Section  8 
discusses  further  work  to  be  pursued.  The  Appendix  lists  the  complete  source  code. 
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2  Generic  Packages  and  Range  Constraints 

2.1  Problem 

The  proposed  specification  requires  that  the  library  of  elementary  functions  be  generic 
and  provide  an  accuracy  comparable  to  the  base  type  of  the  generic  actual  type.  What 
type,  then,  should  be  used  inside  the  generic  package  for  computations?  The  generic 
actual  type  is  unsuitable  because  it  may  have  a  range  constraint  that  can  be  violated 
during  intermediate  calculations  in  the  package.  This  violation  would  then  cause  a 
constraint  error  to  be  raised  even  if  the  final  result  would  have  been  perfectly  acceptable, 
had  the  exception  not  occurred. 

It  is  clear  that  a  constraint-free  type  with  a  precision  comparable  to  that  of  the 
base  type  of  the  generic  actual  type  must  be  used  within  the  generic  package  body.  Let 
W0RKING_FL0AT  be  such  an  ideal  type,  and  let  FLOAT-TYPE  be  the  generic  formal  type. 
How  can  WORKING-FLOAT  be  declared?  One  may  try  declaring 

type  WORKING-FLOAT  is  digits  FLOAT-TYPE’ BASE’ DIGITS. 

Unfortunately,  this  does  not  work  because  FLOAT-TYPE ’BASE ’DIGITS  is  a  nonstatic  ex¬ 
pression.  A  workable  solution,  but  impractical  for  portable  implementations,  would  be 
to  perform  a  case  analysis  as  follows: 

case  FLOAT_TYPE’ BASE ’DIGITS  is 
when  1  => 
declare 

type  WORKING-FLOAT  is  digits  i; 
begin 

—  complete  calculation  of  exp  in  this  case 

—  approximation  accurate  to  at  least  1  digit 
end; 

when  2  => 
declare 

type  WORKING-FLOAT  is  digits  2; 
begin 

—  complete  calculation  of  exp  in  this  case 

—  approximation  accurate  to  at  least  2  digits 
end; 


when  SYSTEM. MAX-DIGITS  => 
declare 

type  WORKING-FLOAT  is  digits  SYSTEM. MAX-DIGITS; 
begin 

—  complete  calculation  of  exp  in  this  case 
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—  approximation  accurate  to  at  least 
—  SYSTEM . MAX _DI GITS  digits 
end; 

end  case; 

For  any  portable  implementation  that  intends  to  accommodate  systems  whose  at¬ 
tribute  SYSTEM. MAX-DIGITS  is  15  or  larger,  the  solution  just  proposed  leads  to  a  huge 
code. 


2.2  Solution 

As  a  practical  solution,  we  condense  the  many  cases  as  follows: 

case  FLOAT -TYPE  * BASE  *  DIGITS  is 
when  1 . . 6  => 
declare 

type  WORKING-FLOAT  is  digits 

(6+SYSTEM. MAX-DIGITS  -  abs (6-SYSTEM. MAX-DIGITS) )/2; 

—  the  expression  above  is  MIN(  6,  SYSTEM. MAX-DIGITS  ) 
begin 

—  complete  calculation  of  exp  in  this  case 

—  approximation  accurate  to  at  least  6  digits 
end; 

when  7 . . IS  *> 
declare 

type  WORKING-FLOAT  is  digits 

(15+SYSTEM. MAX-DIGITS  -  abs ( 15-SYSTEM . MAX-DIGITS) ) /2 ; 
begin 

—  complete  calculation  of  exp  in  this  case 

—  approximation  accurate  to  at  least  15  digits 
end; 


when  27 . . 33  => 
declare 

type  WORKING-FLOAT  is  digits 

(33+SYSTEM. MAX-DIGITS  -  abs (33-SYSTEM. MAX-DIGITS) )/2; 
begin 

—  complete  calculation  of  exp  in  this  case 

—  approximation  accurate  to  at  least  33  digits 
end; 

when  others  => 

—  cannot  handle  this  case 
end  case ; 
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This  method  guarantees  that 


WORKING-FLOAT ’DIGITS  >  FLOAT-TYPE 1  BASE ’ DIGITS  (1) 

always.  Furthermore,  to  avoid  using  a  type  that  is  unnecessarily  more  accurate,  we  note 
that  equality  for  (1)  holds  whenever  the  right  boundary  of  the  case  coincides  with  a 
predefined  type  of  the  machine  on  which  the  code  runs.  Thus,  we  have  chosen  the  cases 
such  that  on  the  all  the  Ada  systems  that  we  have  experience  with, 

WORKING-FLOAT’ DIGITS  =  FLOAT-TYPE ’BASE ’DIGITS  (2) 

for  all  possible  FLOAT-TYPEs. 

3  Accurate  Implementation  with  Extra-Precise  Arithmetic 

A  typical  implementation  of  an  elementary  function  involves  three  steps:  argument  re¬ 
duction  to  a  primary  interval,  approximation  of  the  function  on  that  primary  interval, 
and  reconstruction  of  the  function  value.  In  accurate  implementations,  it  is  standard 
practice  to  perform  argument  reduction  and  reconstruction  in  an  extra-precise  data  type. 
When  such  a  data  type  is  unavailable,  extra  precision  is  simulated  in  software  using  the 
working-precise  data  type  (cf.  [3],  [5]).  Therefore,  an  elementary  function  package  fol¬ 
lowing  that  practice  would  try  to  exploit  extra-precise  data  types  whenever  they  are 
available,  and  resort  to  simulation  when  they  are  not. 

In  principle,  a  portable  generic  package  is  able  to  detect  at  elaboration  time  whether 
an  extra-precise  data  type  is  available,  and  consequently  can  ensure  that  appropriate 
actions  be  taken  when  the  function  is  invoked  later.  In  practice,  however,  because  of 
technicalities  in  constructing  a  portable  generic  package,  such  an  approach  will  lead  to 
a  huge  code  jammed  with  several  sets  of  (otherwise  unnecessary)  constants  used  for 
argument  reduction  and  reconstruction  and  with  many  complicated  branches. 

Fortunately,  there  are  two  practical  alternatives  to  the  impractical  approach,  one 
slightly  more  efficient  while  the  other  noticeably  more  accurate.  We  have  chosen  the 
latter.  In  what  follows,  we  describe  and  compare  the  two  methods,  examine  the  cost  of 
the  method  that  we  adopted,  and  explain  how  to  switch  from  the  adopted  method  to 
the  other. 

3.1  Two  Methods 

In  Section  2.2,  we  described  how  we  could  declare  the  type  WORKING-FLOAT  that  cor¬ 
responds  to  the  same  predefined  floating  type  of  the  generic  actual  type.  Thus,  an 
acceptably  accurate  implementation  can  compute  solely  in  WORKING-FLOAT  and  perform 
simulation  of  extra  precision  at  critical  places.  There  is  an  advantage  as  well  as  disad¬ 
vantages  to  this  method. 
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•  Advantage: 

The  cost  is  minimal.  The  implementation  is  no  more  expensive  than  an  acceptably 
accurate  implementation  need  be.  The  price  of  using  simulation  has  to  be  paid 
even  when  the  package  is  not  required  to  be  generic  or  portable. 

•  Disadvantages: 

1.  This  method  leads  to  a  large  body  of  code.  The  reason  is  that  the  three  steps 
in  the  implementation  —  argument  reduction,  approximation,  and  reconstruc¬ 
tion  —  must  be  included  in  each  of  the  different  cases  of  WORKING-FLOATs  (cf. 
Section  2.2). 

2.  When  extra  precision  is  available  without  simulation,  this  method  is  less  ac¬ 
curate  than  that  of  employing  the  readily  available  higher-precise  arithmetic. 
The  reason  is  that,  because  of  the  high  cost,  extra- precision  simulation  is 
done  only  in  a  few  of  the  many  places  where  higher  precision  would  enhance 
accuracy  noticeably. 

To  overcome  the  disadvantages,  we  have  taken  an  approach  that  uses  unsimulated  ex¬ 
tra  precision  whenever  it  is  available.  Because  of  portability  and  genericity,  the  only  con¬ 
venient  extra-precise  type  is  L0NGEST.FL0AT,  the  type  with  the  maximum  allowable  num¬ 
ber  of  digits.  Moreover,  because  we  cannot  determine  a  priori  whether  WORKING-FLOAT 
leaves  us  with  any  extra-precise  type,  the  implementation  must  simulate  extra-precision 
operations  as  well.  Let  us  consider  the  advantages  and  disadvantages  of  this  approach. 

•  Advantages: 

1.  The  resulting  code  is  compact.  The  reason  is  that  the  code  for  both  argument 
reduction  and  reconstruction  need  appear  only  once.  These  two  steps  compute 
solely  in  L0NGEST.FL0AT  and  work  for  all  possible  generic  actual  types. 

2.  The  accuracy  is  enhanced  in  general.  When 

LONGEST-FLOAT  *  DIGITS  >  WORKING-FLOAT’ DIGITS, 

the  result  obtained  would  be  more  accurate  than  that  obtained  from  compu¬ 
tations  done  solely  in  WORKING-FLOAT,  even  with  the  help  of  extra-precision 
simulation. 

•  Disadvantages: 

1.  Work  is  duplicated  in  this  approach.  When 

LONGEST-FLOAT ’DIGITS  >  WORKING-FLOAT ’DIGITS, 

the  simulation  of  extra  precision  is  rendered  redundant  by  the  use  of  the  type 
LONGEST-FLOAT. 
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2.  The  approach  may  be  unaffordably  inefficient.  It  is  conceivable  that  in  some 
systems,  operations  in  LONGEST-FLOAT  are  extremely  inefficient.  For  example, 
it  is  possible  that  operations  on  the  H-format  (113  significant  bits)  data  types 
supported  by  the  VMS  operating  systems  may  be  implemented  in  software  on 
some  particular  machines. 

We  have  chosen  the  second  approach  because  of  the  higher  accuracy  it  offers  and  the 
compact  code  that  results  from  it.  Section  3.2  below  shows  that  the  disadvantages 
discussed  above  are  insignificant  in  most  cases;  and  Section  3.3  discusses  how  the  first 
approach  can  be  implemented  by  slight  modification  of  our  code  (given  in  the  Appendix). 

3*2  Cost 

How  often  is  work  being  duplicated  in  our  approach?  On  systems  with  only  two  pre¬ 
defined  floating  types,  duplication  occurs  only  half  of  the  time.  In  those  undesirable 
cases,  the  cost  of  the  unnecessary  effort  is  only  approximately  five  multiplications  in 
L0NGEST_FL0AT.  Moreover,  operations  in  L0NGEST.FL0AT  are  efficient  whenever  they  are 
implemented  in  hardware.  Consequently,  on  machines  with  only  two  predefined  floating 
types,  both  supported  in  hardware,  our  implementation  is  justified.  This  applies  to  all 
but  two  Ada  systems  that  we  know  of. 

3.3  Alternative 

On  systems  such  as  the  IBM/370  or  VAX  under  VMS,  there  are  usually  three  predefined 
floating  types.  If  calculations  in  the  widest  format  are  excessively  expensive,  imple¬ 
mentors  can  easily  incorporate  the  argument  reduction  and  the  reconstruction  into  the 
approximation  step  of  the  code  (cf.  the  Appendix).  By  doing  so,  all  calculations  will  be 
performed  in  the  base  type  of  the  generic  actual  type. 

4  Algorithm 

The  algorithm  follows.  Implementation  details  are  given  in  the  next  section. 

Step  1.  Filter  out  the  exceptional  cases.  When  the  magnitude  of  the  input  argument 
X  is  so  large  that  an  accurate  result  cannot  be  represented  in  the  underlying  data 
type,  the  exception  ARGUMENT_ERROR  should  be  raised.  There  are  other  situations 
in  which  that  exception  should  be  raised,  and  they  are  stated  precisely  in  the 
specification. 

Step  2.  Reduce  the  input  argument  X  to  [— log  2/64, log  2/64].  Obtain  integers  n,  m, 
and  j  and  machine  numbers  Rx  and  R2  such  that  (up  to  roundoff) 

X  =  n  log  2/32  +  (I?!  +  R2 ), 
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|i?i  +  -R2I  <  log  2/64.  Furthermore, 

n  =  32  'm  +  j,  j  =  0,1,..., 31. 

Because  of  rounding  errors,  \Ri+R2\  may  exceed  log  2/64  by  a  few  units  of  its  last 
place,  and  the  calculated  value  of  n  may  also  differ  by  1  from  the  integer  closest  to 
32X /  log  2.  The  analysis  later  on  will  show  that  implementation  is  still  accurate 
despite  these  rounding  errors. 

Step  3.  Approximate  exp(i?i  +  R2)  -  1  by  a  polynomial  p(Ri  +  R2),  where 

p(t)  —  t  +  flit2  +  fl2^  H - f  Ont”"*"*  • 

Step  4.  Reconstruct  exp(X)  by 

exp(X)  =  2m(2i/32  +  2j/32p(Ri  +  R2)). 

5  Implementation  Details 

5.1  Assumptions  about  Floating-Point  Arithmetic 

Ada  demands  that  certain  behavior  of  the  underlying  floating-point  arithmetic  be  sat¬ 
isfied  when  the  operands  involved  are  safe  numbers.  For  example,  for  safe  numbers  A 
and  B,  A  -  B,  in  the  absence  of  underflow,  has  to  be  exact  whenever  cancellation  occurs. 
The  reason  is  that,  because  of  cancellation,  the  difference  is  a  safe  number.  The  same 
inference,  however,  cannot  be  made  if  the  numbers  A  and  B  are  merely  machine  numbers. 

In  practice,  implementations  must  be  able  to  handle  machine-number  input  and, 
ideally,  provide  an  accuracy  with  respect  to  the  machine  precision  that  is  in  general 
higher  than  that  of  the  safe  numbers.  Consequently,  our  implementation  needs  to  make 
certain  assumptions  about  the  floating-point  data  types  of  the  target  machines.  These 
assumptions  are  related  to  the  exponent  width,  the  arithmetic,  and  the  radix. 

The  assumptions  are  as  follows: 

•  Exponent  Width:  We  assume  that  the  number  of  bits  in  the  exponent  field  never 
exceeds  L/ 3,  where  L  is  the  actual  number  of  binary  bits  in  the  mantissa  of  the 
machine.  Otherwise,  the  accuracy  of  the  final  result  would  degrade  as  the  mag¬ 
nitude  of  the  input  argument  becomes  large.  This  assumption  is  built  into  the 
number  of  bits  of  the  value  log  2/32  we  have  stored  in  the  program. 

•  Arithmetic:  Let  A  and  B  be  two  machine  numbers  such  that 

2B  >  A  >  B. 


Thus,  cancellation  occurs  in  A  —  B. 


On  binary  machines,  we  assume  that  the  subtraction  is  exact  whenever  B  has  one 
(or  more)  trailing  zero  bit(s). 

On  nonbinary  machines,  we  assume  that  A  -  B  is  exact.  This  assumption  requires 
in  particular  that  a  guard  digit  be  present  in  the  subtraction  hardware. 

If  this  assumption  on  arithmetic  is  violated,  our  scheme  for  argument  reduction 
may  fail  to  be  accurate. 

•  Radix:  We  assume  that  the  radix  is  either  2  (binary)  or  16  (hexadecimal).  We  made 
this  assumption  because  on  most  binary  and  hexadecimal  floating-point  arithmetic 
that  we  know  of,  the  previous  assumption  about  arithmetic  is  satisfied. 

All  the  Ada  systems  with  which  we  have  experience  satisfy  our  assumptions.  The 
following  tabulates  those  machines1  (cf.  [2]). 


System 

Data 

Type 

Radix 

’digits 

Mantissa 
Length  (in  bits) 

Exponent 
Width  (in  bits) 

IBM/ 

370 

single 

16 

6 

24 

7 

double 

16 

15 

48 

7 

quad 

16 

20 

96 

7 

CRAY-1 

single 

2 

13 

48 

14 

double 

2 

27 

98 

14 

VAX/ 

VMS 

single 

2 

6 

24 

8 

d-format 

2 

9 

56 

8 

g-format 

2 

15 

53 

11 

h-format 

2 

33 

113 

15 

IEEE/ 

754 

single 

2 

6 

24 

8 

double 

2 

15 

53 

11 

5.2  Implementation 

The  following  notes  correspond  to  the  algorithm  in  the  previous  section.  All  computa¬ 
tions  are  carried  out  in  the  order  prescribed  by  the  parentheses.  In  the  following  discus¬ 
sions,  X  is  the  input  argument,  FLOAT-TYPE  is  the  generic  formal  type,  and  LONGEST-FLOAT 
is  the  type  declared  as  digits  SYSTEM. MAX-DIGITS. 

Step  1.  The  exceptional  cases  are  as  follows: 

•  Raise  ARGUMENT-ERROR  if  |X|  >  LARGE-THRESHOLD,  where 

LARGE-THRESHOLD  :=  2  *  FLOAT-TYPE  ’  SAFE-EMAX  *  log  2. 

1T^e  ’digits  attributes  for  IBM/370  quad  precision  and  VAX  d-formats  are  shorter  than  the  mantissa 
can  offer  because  the  exponent  ranges  of  those  data  types  are  limited. 
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Note  that  for  |X|  >  LARGE-THRESHOLD, 

ex  >  2  •  FLOAT-TYPE’ SAFE-LARGE, 
or 

ex  <-•  FLOAT-TYPE’ SAFE-SMALL. 

“  2 

Thus,  some  arguments  that  are  not  filtered  out  here  may  still  warrant  an 
exception.  The  computations  in  the  next  three  steps  will  be  able  to  handle 
those  situations  (cf.  exceptional  handling  in  the  code). 

•  Return  1.0  +  X  if  |X|  <  SMALL-THRESHOLD,  where 

SMALL-THRESHOLD  :=  FLOAT-TYPE ’BASE ’EPSILON. 

Step  2.  Let  Y  :=  LONGEST-FLOAT(X)  be  the  input  argument  converted  to  L0NGEST-FL0AT. 
To  perform  the  argument  reduction  accurately,  do  the  following. 

•  First,  calculate  N,  Nj ,  and  N2  as  follows: 

N  :=  LONGEST-INTEGER(Y  *  INV-L); 

—  see  explanation  below  on  LONGEST-INTEGER 
if  |N|  >  28  then 
N2  :=  N  mod  26; 

— N2  =  0, 1, . . .  ,26  —  1 
Ni  :=  N  —  N2; 
otherwise, 

N2  :=  0; 

Ni  :=  N. 

INV-L  is  the  value  32/ log  2  represented  in  L0NGEST-FL0AT.  The  declaration 

type  LONGEST-INTEGER  is  range 

SYSTEM . MIN-INT . . SYSTEM . MAX.INT ; 

is  meant  to  provide  an  integer  type  that  would  accommodate  all  the  possible 
round-to-integer  values  of  Y  *  INV-L.  A  better  way,  however,  is  to  use  the 
proposed  primitive  function  ([6])  REAL-INT  that  returns  in  floating  type  the 
integral  value  closest  to  a  given  real  number. 

•  The  reduced  argument  is  represented  in  two  L0NGEST-FL0AT  variables  Ri  and 
R2.  The  idea  is  to  represent  log  2/32  to  extra  precision  by  two  L0NGEST-FL0AT 
numbers  Li  and  L2.  The  values  of  Li  and  L2  are  chosen  at  run  time  in  such  a 
way  that  Li  +  L2  represents  log  2/32  to  a  sufficient  number  of  bits  more  than 
the  mantissa  length  of  L0NGEST-FL0AT.  After  Li  and  L2  are  chosen,  compute 
Ri  and  R2  as  follows: 

TMP  :=  Ni  *Li; 
if  |Y|  >  |TMP|  then 
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Ri  :=  Y  —  TMP; 
otherwise, 

Rl  :=  (Y  -  TMP/2)  -  TMP/2; 

If  N2  ^  0,  Ri  :=  Ri  -  N2  *  Li; 

R2  :=  — N  *  L2. 

•  Finally,  calculate  M  and  J  by 

J  :=  N  mod  32;  M  :=  (N  -  J)/32. 

Step  3.  Let  WORKING-FLOAT  be  the  type  described  in  Section  2.  The  approximation 
polynomial  is  computed  by  a  standard  recurrence  in  WORKING-FLOAT  as  follows: 

R  :=  W0RKING_FL0AT(Ri  +  R2); 

POLY  :=  R*R*(Ai  +  R*(A2+R*(...  +  R*An)...)); 

Q  :=  Ri  +  (R2  +  LONGEST-FLOAT(POLY)). 

The  coefficients  are  obtained  from  a  Remez  algorithm. 

Step  4.  Each  of  the  values  2J-/32,  j  =  0,1,..., 31,  is  calculated  beforehand  and  repre¬ 
sented  by  two  LONGEST-FLOAT  numbers 

TW0_T0_J_BY_32(J,  LEAD)  and  TW0_T0-J_BY_32(J,  TRAIL). 

The  leading  part  has  thirteen  significant  bits,  and  the  trailing  part  has  full  preci¬ 
sion.  Thus  the  sum  of  the  two  represents  2J/32  to  roughly  thirteen  more  bits  than 
LONGEST-FLOAT  has  in  its  mantissa.  The  reconstruction  is  as  follows: 


F  :=  TW0-T0-J  _BY_32(J,  LEAD)  +  TW0_T0_J_BY-32(J,  TRAIL) 
Ql  :=  TW0_T0_J_BY_32(  J,  LEAD) 

Q2  :=  TW0_T0_J_BY_32(J,  TRAIL)  +  F*Q 
EXP  :=  2*1  *  (Qi  +  Q2)  for  binary  machines 
EXP  :=  2M  *  Qi  +  2M  *  Q2  for  hexadecimal  machines 


6  Error  Analysis  with  Ada’s  Model 


The  proposed  specification  for  the  exponential  function  EXP  requires  that 


EXP(X)  -  ex 
e* 


<  4  *  FLOAT-TYPE* BASE’ EPSILON 


for  most  input  values  X.  The  analysis  to  follow  will  show  that  with  moderate  assump¬ 
tions  about  the  underlying  floating-point  arithmetic,  the  accuracy  requirement  is  easily 
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achieved  by  our  algorithm  implemented  as  described.  Since  most  systems  perform  arith¬ 
metic  more  accurately  than  prescribed  by  Ada’s  model,  our  analysis  will  inevitably  be 
pessimistic.  In  Section  6.6,  we  will  discuss  accuracy  in  terms  of  machine  precision. 

Our  goal  in  the  analysis  is  to  estimate,  in  terms  of 

FLOAT-TYPE ’ BASE ’ EPSILON, 

the  final  error  in  the  implemented  function.  It  is  obvious  that  our  implementation  is 
least  accurate  when 

FLOAT-TYPE’ BASE’ EPSILON  =  LONGEST-FLOAT’ EPSILON. 

Thus,  we  will  analyze  this  case  only.  Throughout  the  analysis,  we  will  use  e  and  radix  to 
denote  ’EPSILON  and  ’MACHINE-RADIX  of  LONGEST-FLOAT,  respectively,  and  “exponent 
width”  to  denote  the  number  of  binary  bits  in  the  exponent  field  of  its  predefined  floating 
type.  We  find  the  following  notation  useful  in  error  analysis: 

•  Typefaced  letters,  X,  Y,  P,  Q,  etc.,  denote  real  numbers  that  are  representable 
exactly  in  the  machine. 

•  Angle  brackets  (•  •  •)  denote  the  value  of  a  real  number  “•  •  •”  rounded  to  machine 
precision.  Thus,  executing  the  statement 


A  :=  B  *  C 


in  machine  precision  gives  the  value 


A  =  (B-C). 

•  €  denotes  the  value  FLOAT-TYPE ’BASE ’EPSILON. 

•  Let  x  be  a  real  number.  We  define  £(x)  to  be  the  difference  between  the  value  of 
x  when  rounded  to  working  precision  and  x  itself,  thus: 

f(x)  :=  (x)  -  x. 

•  If  one  uses  (•)  and  £(•),  the  relationship 

(A  op  B)  =  A  op  B  +  f(A  op  B) 

holds  for  any  machine-precision  values  A  and  B  and  for  each  of  the  four  basic 
operations  +,  — ,  •,  and  /. 
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Given,  safe  numbers  A  and  B,  and  op  in  *—,*,/},  let  k  be  the  unique  integer  such 
that 

2k  <  |A  op  B|  <  2k+1. 

Then,  with  the  notation  just  introduced,  the  Ada  model  guarantees  that 

|£(A  op  B)|  <  2k  e. 

We  are  now  ready  to  perform  the  analysis. 

6.1  Classification  of  Errors 

We  can  classify  the  errors  in  our  algorithm  into  three  categories: 

•  Error  in  reduction.  The  computed  reduced  argument  Rj  +  R2  differs  from  the  cor¬ 
rect  one  r  defined  by  the  equation 

x  =  (32 m  +  j)  log  2/32  +  r. 

•  Error  in  approximation.  The  approximating  polynomial  or  rational  function  p(r) 
differs  from  exp(r)  —  1. 

•  Rounding  errors.  Errors  will  be  committed  as  we  compute  p{r)  and  the  final 
reconstruction;  such  is  the  nature  of  finite-precision  arithmetic  on  computers. 

Our  analysis  treats  each  of  the  three  categories  of  error  independently  before  com¬ 
bining  them. 

6.2  Error  in  Reduction 

Let  Ri,R2,N,Ni,N2,M,  and  J  be  the  values  as  obtained  in  step  2  of  the  implementation. 
We  estimate  the  difference  between  the  value  Ri  +  R2  and  the  correct  reduced  argument 
r,  where 


r  =  Y  —  N  •  log  2/32. 

Several  observations: 

•  Li,L2  are  so  chosen  (see  the  source  code  for  details)  that 

|Li  +  L2  -  log2/32|  <  2-10  •  2-exP°nent  width  .  e  and 
|L2|  <  2-9  •  2-exPonent  width 

•  Both  NjLi  and  N2Li  are  safe  numbers  with  at  least  one  trailing  zero.  This  is 
because  Li  is  a  safe  number  with  at  least  nine  trailing  zeros  and  both  Ni  and  N2 
never  exceed  eight  bits  by  design. 
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•  Consider  the  calculation  of  Ri : 

TMP  :=  Ni  *Li; 
if  |Y|  >  |TMP|  then 
Rl  :=  Y  -  TMP; 
otherwise, 

Rl  :=  (Y  -  TMP/2)  -  TMP/2; 
if  N2  ^  0,  Ri  :=  Ri  —  N2  *  Li . 

The  crucial  observation  is  that  the  calculations  above  are  error  free  when  performed 
on  any  hexadecimal  machine  with  a  guard  digit  for  subtraction,  and  on  any  binary 
machines  with  or  without  a  guard  bit  for  subtraction.  The  reason  is  that  both 
N1L1  and  N2L1  have  at  least  one  trailing  zero  bit,  and  cancellation  occurs  in  each 
of  the  subtractions  above. 

Using  these  observations,  we  can  estimate  the  error  in  reduction  as  follows: 


R1+R2  =  Y-(N-Li  +  (N-L2» 

=  Y-N(Li+L2)  +  £(N-L2). 


Now,  because 

|N|  <  |Y  •  32/  log  2| 

<  |LARGE_THRESHOLD  •  32/  log  2| 

<  |64  •  2exPonent  width"1 .  log2(radix)| 

<  log2(radix).2exP°nentwidth+5, 

therefore 

|N  •  L2|  <  2-6  and  |£(N  •  L2)|  <  2~7c. 

Consequently, 

|(Rl+R2)-(Y-N -log 2/32)1  <  N  •  |Li  +  L2  —  log2/32|  +  |£(N  •  L2)| 

<  2_7e  +  2_7c 

<  2-6€. 

6.3  Error  in  Approximation 

We  estimate  the  difference  between  the  transcendental  function  e‘  -  1  and  the  approxi¬ 
mating  polynomial 

p(t)  =  t  +  Ait2  + - 1-  Antn+1 

for  t  6  [-log 2/64, log 2/64].  The  estimation  is  done  by  locating  numerically  all  the 
extreme  points  of  e*  -  1  -  p{t)  in  the  interval  [-0.010831,0.010831]  (slightly  wider  than 
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[-  log  2/64,  log  2/64]).  In  our  code,  five  different  polynomials  are  used  for  different  ranges 
of  FLOAT-TYPE* DIGITS.  In  each  of  those  ranges,  we  find  that 

|e*  —  1  —  p(t)|  <  2-6e 

for  all  t  G  [—0.010831,0.010831],  where  e  corresponds  to  the  maximum  number  of  digits 
in  that  particular  range. 

6.4  Rounding  Errors 

Here  we  are  concerned  with  the  difference  between  the  value  EXP  obtained  by  executing 
R  !=  Ri  +  R2 

POLY  :=  R*R*(Ai  +  R*(A2  +  R*  (. . .  +  R *  An) .. .)) 

Q  :=  Ri  +  (R2  +  LONGEST-FLOAT(POLY)) 

F  :=  TW0_T0_J_BY_32(J,  LEAD) +  TW0-T0_J-BY_32(J,  TRAIL) 
qt  :=  TW0_T0_J_BY_32(  J,  LEAD) 

Q2  :=  TW0-T0.J_BY_32(J,  TRAIL)  +  F  *  Q 
EXP  :=  2M  *  (Qi  +  Q2)  for  binary  machines 
EXP  :=  2*  *  Qi  +  211  *  Q2  for  hexadecimal  machines 

and  the  value  we  would  have  obtained  had  all  the  preceding  calculations  been  error  free. 
Three  observations  simplify  our  analysis.  First,  on  binary  machines,  the  execution  of 

2"*(Qi  +Q2) 

and 

2s  *  Qi  +  2m  *  Q2 

yields  identical  results.  Second,  the  magnitude  of  POLY  is  at  most  f(log2/64)2,  which 
is  less  that  2-13.  Thus  the  rounding  errors  accumulated  in  POLY  are  practically  zero. 
Third,  2**  *  Qi  is  exact  because  Qj  is  a  safe  number. 

To  shorten  the  exposition  that  follows,  we  use  Sj  and  S2  to  denote 

TW0_T0_JJBY_32(  J ,  LEAD)  and  TW0_T0_J_BY_32(J,  TRAIL), 

respectively.  We  axe  now  ready  to  begin. 

Using  the  observations  and  the  notation,  we  are  going  to  estimate  the  difference 
between 

(2"  •  Qi  +  (2"  •  (S2  +  ((Sj  +  S2)  •  (Ri  +  (R2  +  POLY)))))) 

and 

2"  •  Qi  +  (2"  •  (S2  +  ((Si  +  S2)  •  (Ri  +  (R2  +  POLY))))). 

To  obtain  a  good  estimate,  we  must  give  a  careful  account  for  each  deviation  of  our 
computed  value  from  the  ideal  one.  We  use  E0  to  denote  the  ideal  result.  Ex  denotes 
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the  first  corrupted  result,  E2  the  second,  and  so  on.  E7  is  the  final  computed  result,  and 
the  rounding  error  is  simply  the  difference  Eq  —  E7. 


E0  :=  2«  •  Qi  +  (2”  •  (S2  +  ((Si  +  S2)  •  (Ri  +  (R2  +  POLY))))) 

£x  :=  2“  •  Qi  +  214  •  (S2  +  (Sj  +  S2)  •  (Ri  +  (R2  +  POLY))) 

E2  :=  2"  •  Qi  +  2M  •  (S2  +  (Si  +  S2)  •  (Ri  +  (R2  +  POLY))) 

£3  :=  2”  •  Qi  +  2H  •  (S2  +  (Si  +  S2)  •  (Ri  +  (R2  +  POLY))) 

£4  ==  2m-Qi  +  2m-(S2  +  ((Si  +  S2)-(Ri  +  (R2+P0LY)))) 

£5  :=  2"  •  Qi  +  2”  •  (S2  +  ((Si  +  S2)  •  (Ri  +  (R2  +  POLY)))) 

Eq  :=  2”  •  Qi  +  (2“  •  (S2  +  ((Si  +  S2)  •  (Ri  +  (R2  +  POLY))))) 

E7  :=  (2h  •  Qi  +  (2”  •  (S2  +  ((Si  +  S2)  •  (Rj  +  (R2  +  POLY)))))) 

We  also  name  the  following  values  by  Fi,F2,  ...  ,£7  because  these  values  arise  often 
in  what  follows. 

Fi  :=  R2  +  POLY 

F2  :=  Ri  +  (R2  +  P0LY) 

Fz  :=  Si+S2 

£4  :=  (Si  +  S2)  •  (Ri  +  (R2  +  POLY)) 

£5  :=  S2  +  ((Si  +  S2)-(Ri  +  (R2+P0LY))) 

Fq  :=  2M  •  (S2  +  ((Si  +  S2)  •  (Ri  +  (R2  +  POLY)))) 

£7  :=  2m-Qi  +  (2m-(S2  +  ((Si  +  S2)-(Ri  +  (R2+P0LY))))) 

Now  the  estimates: 


|  rounding  errors  |  =  |£q  —  £7!  <  ^  |£f-i  —  £»|, 


|£0-£i|  =  2M|Si  +  S2|.|(R2  +  P0LY)-(R2  +  P0LY)| 

=  2m|£3|  •  |f(£i)|, 

|£i-£2|  =  2M|£3h|e(£2)l, 

|£2-£3|  =  2M|(£2)|  •  \t(Fz)\, 

\Ez  -  £4|  =  2M|£(£4)|, 

|£4  -  £5 1  =  2m|£(£5)|, 

|£5-£6|  =  K(£6)|, 

|£6  -  £7|  =  |£(£r)|. 
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To  get  an  estimate  of  |£(l*j)|  for  j  —  1, . . . ,  7,  we  need  know  only  the  rightmost  binary 
intervals  in  which  the  various  \Fj\'s  may  lie.  Note  that  each  of  the  Fj’ s  is  the  computed 
result  of  some  value  whose  range  is  known.  Consequently,  unless  the  largest  magnitude 
achieved  by  those  values  lies  very  close  to  a  power  of  2,  the  rightmost  binary  intervals 
in  which  those  values  may  lie  are  the  binary  intervals  we  seek.  We  tabulate  the  results 
below. 


Value 

Range 

Conclusion  Drawn 

IN 

[0, 2-10-78] 

l£(*i)|  <  2~ne 

W»0I 

[0, 2-6-52] 

l£(*i»)|  <  2-7e,  |F2|  <  2-6,5 

|2i/32| 

[0, 231/32] 

MF9)\  =  0  for  j  =  0; 

|2J/32p(r)| 

2i/32[o,  2-6.52] 

k(^3)|  <  f  otherwise. 

\£(F4)\  <  2~7e  for  j  =  0; 

|S2  +  2J/32p(r)| 

2j/32[0,  2-6-49] 

\£(F4)\  <  2-6c  otherwise. 

<  2-7e  for  j  =  0; 

2m|S2  +  2jf/32p(r)| 

lf(^5)|  <  2-6e  otherwise. 

2M2i/32[Q)  2—6.49] 

l^6)|<2M2-7cfory  =  0; 

|2M2i/32er| 

lf(^6)|  <  2M2-6e  otherwise. 

2A/2j/32[2-l/64>2l/64] 

|e(FV)|  <  2m~1€  for  2J/32er  <  1; 

\am  <  2Me  otherwise. 

Thus,  when  j  =  0, 


|  rounding  error  |  < 

< 

When  j  =  1, 

|rounding  error|  < 
< 

Note  also  that 
and 


l«*r)|  +  2M~1  •  c  •  (2~10  +  2"6  +  0  +  2“6  +  2"6  +  2~6) , 
KWi  +  2 •  c  •  0.06348. 

\t(F7)\  +  2m  ■  €  •  (2-10  +  2-6  +  2-6-5  +  2-6  +  2“6  +  2~6), 
te(*V)|  +  2M  •  c  •  0.07453. 


eY  =  2m  •  2i/32  •  er, 
\{(F7)\/(2M-2j'32-er)  <  e. 


6.5  Overall  Error 

Finally,  we  estimate  the  overall  relative  error 

|eY  -  (2*  •  Qi  +  (2"  •  (S2  +  ((Si  +  S2)  •  (Rj  +  (R2  +  P0LY»))))|  j  eY. 
From  the  previous  analysis, 
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[absolute  error | 

=  |2m2^3V-{2m-Q1  +  (2m-(S2  +  ((S1+S2)-(Ri+(R2+P0LY))))))| 

<  2m2j/32  •  \er  -  eRi+R2 1  +  2M2J’/32  •  |eRl+R2  -  1  -  p( Ri  +  R2)| 

+  |2M2J/32p(Ri  +  R2)  +  2m2j/32 

— <2m  •  Qj  +  (2*  •  <S2  +  ((Si  +  S2)  •  (Ri  +  <R2  +  P0LY)))»)| 

<  2m2j'/32  fl.01|r  -  (Ri  +  R2)|  +  |eRi+R2  -  1  -  p(Ri  +  R2)|) 

+  |2M2J/32p(Ri  +  R2)  +  2m2j/32 

!  —  (2m  •  Qi  +  (2*  •  (S2  +  ((Si  +  S2)  •  (Ri  +  (R2  +  P0LY>)»»| 

<  2m2j/32  (1.01|error  in  reduction!  +  |error  in  approximation!) 

i  +  |  rounding  error  | 

<  2m2j/32(1.01  •  2-6  +  2-6)c  +  |rounding  error|. 

When  j  =  0,  eY  >  2M"1  and 
|  relative  error  | 

<  2  (1.01  •  2~6  +  2~6)  e  +  0.06348e  +  (\Z(F7)\  /  eY) 

<  1.13e. 

When  j  >  1,  e1  >  2M  and 
[relative  error | 

<  2 31/32  (1.01  •  2-6  +  2-6)  e  +  0.07453e  +  (| £(JV)|  /  eY) 

<  1.14c. 

Thus,  the  relative  error  of  the  implementation  stays  well  within  the  required  threshold 
of  4e. 

6.6  Remarks 

On  all  Ada  systems  that  we  have  experience  with,  the  implementation  is  actually  capable 
►  of  delivering  comparable  accuracy  with  respect  to  the  precision  offered  by  the  underlying 

hardware.  Moreover,  on  machines  such  as  the  VAX  or  those  with  floating-point  arith¬ 
metic  conforming  to  ANSI/IEEE  Standard  754-1985,  the  previous  analysis  is  pessimistic. 
In  particular,  a  similar  implementation  that  is  tailored  specifically  to  IEEE  754  arith¬ 
metic  has  been  proved  accurate  to  within  0.54  unit  of  last  place.  In  a  later  paper,  we 
will  analyze  in  detail  the  accuracy  of  our  implementation  on  the  various  machines  we 
are  interested  in. 

7  Test  Results 

The  code  as  listed  in  the  Appendix  has  been  run  on  a  Sequent  VADS  compiler  version 
5.41.6,  an  IBM  PC/AT  using  the  Meridian  AdaVantage  compiler  version  2.0,  and  a  VAX 
8650  using  DEC  Ada  version  1.4. 
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We  have  also  tested  the  implementation  on  the  Sequent  and  the  IBM  PC/AT  using 
the  ELEFUNT  test  transcribed  into  Ada  by  K.  W.  Dritz. 

On  the  Sequent,  there  are  two  predefined  floating-point  types  with  24  and  53  signifi¬ 
cant  bits,  ’digits  6  and  15,  respectively.  Thus  the  accuracies  offered  by  the  two  sets  of 
safe  numbers  are  21  and  51  bits,  respectively;  and  those  offered  by  the  two  machine  for¬ 
mats  are  24  and  53  bits,  respectively.  The  ELEFUNT  results  are  summarized  as  follows. 
(For  a  description  of  the  test,  see  [3].) 


Generic 

Accuracy 

Reported  Loss  of  Binary  Bits 

Actual  Type 

with  respect  to 

Max.  Relative  Error 

Root  Mean  Square 

’digits  6 

21  bits 

6.00 

0.00 

’digits  6 

24  bits 

0.99 

0.00 

’digits  15 

51  bits 

0.00 

0.00 

’digits  15 

53  bits 

0.99 

0.00 

On  the  IBM  PC/AT  with  the  Meridian  AdaVantage,  there  is  only  one  predefined 
floating-point  type  with  53  significant  bits,  ’digits  15.  Thus,  the  accuracies  offered 
by  the  safe  numbers  and  the  machine  format  are  51  bits  and  53  bits,  respectively.  The 
results  are  summarized  as  follows. 


Generic 

Accuracy 

Reported  Loss  of  Binary  Bits 

Actual  Type 

with  respect  to 

Max.  Relative  Error 

Root  Mean  Square 

’digits  15 

51  bits 

0.00 

0.00 

’digits  15 

53  bits 

1.00 

0.00 

8  Conclusion  and  Future  Work 

We  have  shown  that  the  environmental  inquiries  and  other  numerical  features  provided 
by  Ada  make  portability  and  provability  of  some  numerical  software  possible.  With 
conscientious  effort,  a  reasonably  portable  and  accurate  exponential  function  can  be 
implemented. 

Our  experience  with  the  sample  implementation  presented  here  strongly  suggests 
that  the  following  four  projects  are  within  reach.  We  are  committed  to  the  first  two,  and 
we  hope  that  circumstances  will  allow  us  to  pursue  the  others. 

•  Specification  of  Elementary  Functions:  The  sample  implementation  has  pro¬ 
vided  us  with  valuable  guidelines  on  proposing  a  specification  for  the  elementary 
function  library.  We  will  continue  to  participate  in  the  formulation  of  the  specifi¬ 
cation. 

•  Library  of  Elementary  Functions:  A  portable  complete  library  of  the  twenty 
elementary  functions  ([7]  and  [6])  can  be  implemented  by  using  strategies  similar 
to  those  employed  here.  The  only  technical  challenge  we  foresee  now  is  an  accurate 
reduction  routine  that  finds  the  remainder  of  a  machine  number  with  respect  to 
the  transcendental  number  w. 
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•  Library  of  Primitive  Functions:  We  intend  to  construct  a  library  of  primitive 
functions  similar  to  those  proposed  in  [6].  Since  the  functions  here  are  of  a  much 
lower  level,  a  portable  implementation  may  not  be  practical  or  even  possible  in 
some  cases.  Some  of  the  elementary  functions  may  be  constructed  on  top  of  this 
basic  library.  (Our  exponential  function,  though  not  dependent  on  this  library,  will 
benefit  from  it.) 

•  Validation:  From  our  experience  so  far,  a  portable  test  suite  seems  to  be  im- 
plementable.  The  test  suite  we  have  in  mind  consists  of  two  parts.  The  first  part 
basically  will  be  a  transcription  of  the  ELEFUNT  test  in  [3].  The  ELEFUNT  test 
is  adept  in  reporting  possible  mistakes  and  in  estimating  the  accuracy  of  the  func¬ 
tion  under  test.  The  second  part  of  the  test  will  try  to  report  the  exact  deviation  of 
the  function  under  test  from  the  correct  value.  In  the  past,  such  a  task  has  usually 
been  performed  only  if  higher-precision  function  values  are  available.  With  the 
numerical  features  of  Ada  —  for  example,  accurate  conversion  of  universal  reals  to 
model  numbers,  and  table-driven  techniques  like  those  employed  in  our  exponential 
function  —  we  believe  such  a  task  can  be  accomplished  portably  even  without  an 
extra-precise  function. 

A  test  suite  as  such  should  also  be  useful  in  validating  libraries  that  claim  confor¬ 
mance  to  the  proposed  specification. 
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Appendix 

The  following  is  the  complete  source  program  for  the  exponential  program, 
package  MATHEMATICAL_EXCEPTIONS  is 
ARGUMENT_ERROR  :  exception; 
end  MATHEMATICAL_EXCEPTIONS ; 


with  MATHEMATICAL.EXCEPTIONS; 
generic 

type  FLOAT.TYPE  is  digits  <>; 
package  GENERIC_ELEMENTARY_FUHCTIONS  is 

function  EXP(  X  :  FLOAT.TYPE  )  return  FLOAT_TYPE; 

—  other  functions  to  he  added  later 

ARGUMENT.ERROR  :  exception  renames  MATHEMATICAL_EXCEPTIONS . ARGUMENT_ERROR; 
end  GENERIC_ELEMENTARY_FUNCTIONS ; 


with  SYSTEM; 

with  TEXT_IO;  use  TEXT_IO; 

package  body  GENERIC .ELEMENT ARY _FUN CTIONS  is 

As  of  2/4/88,  this  package  contains  only  the  exponential 
function.  More  functions  will  be  added  later. 

—  FLOAT.TYPE  is  the  floating-point  type  with  which  the  user 
instantiates  this  package.  Computation  in  this  type  is 

--  avoided  to  insulate  ourselves  from  any  possible  range 
constraints  imported  with  the  type. 

Two  floating-point  types  are  defined  in  this  package  body: 

—  LONGEST_FLOAT  is  the  floating-point  type  having  'DIGITS  equal  to 
SYSTEM . MAX_DIGITS .  This  type  is  needed  here  to  perform 
argument  reductions  and  final  reconstructions  of  elementary 
function  values  in  the  maximum  precision  available. 
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—  WORKING_FLOAT  is  the  floating-point  type  in  which  the  approximations 

—  for  elementary  functions  are  carried  out.  This  type  is  so  defined 
~  that 

WORKING.FLOAT' DIGITS  =  FLOAT_TYPE  *  BASE , DIGITS 

—  on  all  the  Ada  systems  we  have  experience  with. 

—  However,  there  may  be  some  (unknown  to  us)  systems  for  which 

W0RKINGJFX0AT, DIGITS  >  FLO AT .TYPE' BASE' DIGITS 

—  Thus,  type  WORKING_FLOAT  has  at  least  the  precision  of  the  base 

—  type  of  FLOAT_TYPE,  and  usually  it  does  not  have  excess  precision. 


—  Assumptions: 

—  This  package  body  is  portable  to  a  particular  implementation  only 

—  if  the  following  assumptions  are  valid  in  that  implementation: 

—  (1)  6  <=  SYSTEM . MAX JDIGITS  <=  33 

—  (2)  The  following  assumptions  sure  made  on  floating-point 

arithmetic: 

(a)  Radix:  The  radix  will  be  either  2  or  16. 

(b)  Exponent  Width:  We  assume  that  the  number  of  bits  in 
the  exponent  field  of  the  floating-point  format  never 
exceeds  L/3,  where  L  is  the  actual  number  of  bits 

in  the  mantissa  of  the  machine. 

(c)  Arithmetic:  Let  A  and  B  be  two  machine  numbers 

such  that  2B  >=  A  >=  B.  Then,  cancellation  occurs  in 
A  -  B.  On  binary  machines,  we  assume  that  the 
subtraction  is  exact  whenever  B  has  one  (or  more) 
trailing  zero  bit(s).  On  nonbinary  machines,  we 
assume  that  A  -  B  is  exact.  This  assumption  requires 
in  particular  that  a  guard  digit  be  present  in  the 
subtraction  hardware  for  the  nonbinary  machines. 

—  If  the  assumptions  (1)  and  (2a)  are  invalid,  the  predefined 

—  exception  PROGRAM.ERROR  will  be  raised. 

—  Some  compilers  could  detect  at  compile  time  that  it 

—  must  always  be  raised  at  run  time,  thus  calling  attention  to  the 

—  violation  of  the  assumptions  at  compile  time. 


type  L0NGEST_FL0AT  is  digits  SYSTEM . MAX_DIGITS ; 
type  LONGEST.INTEGER  is  range  SYSTEM . MIN_INT . . SYSTEM . MAX_INT ; 
type  POSITION  is  (LEAD,  TRAIL); 
subtype  INDEX  is  LONGEST JENTEGER  range  0..31; 

—  The  following  tables  of  constants  are  needed  by  several  of  the  elementary 
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—  functions. 


—  TW0_T0_J_BY_32  is  an  array  of  32  pairs  of  LONGESTJFLOAT  numbers 
--  representing  2**(j/32)  for  j  =0,  1,  2,  ....  31.  Each  such  value 

is  represented  by  LEAD  +  TRAIL.  The  leading  parts  contain  13  bits 
o*  information  and  are  consequently  model  numbers  as  long  as 
SYSTEM. MAX _DIGITS  is  >=  4.  The  trailing  parts  contain  roughly 
LON GEST_FL0 AT ’ MANTISSA  bits  of  information  under  the  assumption 
“  that  SYSTEM. MAX _DIGITS  is  <=  35.  So,  when  the  assumptions  are  met, 

—  LEAD  +  TRAIL  represents  2**(j/32)  to  roughly  13  extra  bits. 

TW0_T0_J_BY_32  :  constant  array (  INDEX,  POSITION  )  of  L0NGEST_FL0AT  :=  ( 

0  =>  (LEAD  s>16#1.000#,  TRAIL  =>16#0. 000000000000000000000000000000000#) , 

1  =>  (LEAD  =>16#1 .059#,  TRAIL  =>16#0.000B0D31585743AE7C548EB68CA417FE5#) , 

2  =>  (LEAD  =>16#1.0B5#,  TRAIL  »>16#0 .000586CF9890F6298B92B71842A983642#) , 

3  =>  (LEAD  =>16#1 .113#,  TRAIL  =>16#0.00001D0125B50A4EBBF1AED9318CEAC5C#) , 

4  =>  (LEAD  =>16#1 . 172#,  TRAIL  =>16#0.000B83C7D517ADCDF7C8C50EB14A79203#) , 

5  =>  (LEAD  =>16#1 . 1D4#,  TRAIL  =>16#0.000873168B9AA7805B8028990F07A98B4#) , 

6  =>  (LEAD  S>16#1 .238#,  TRAIL  =>16#0.0007A6E7S623866C1FADB1C15CB593B03#),, 

7  =>  (LEAD  =>16#1.29E#,  TRAIL  =>16#0.0009DF51FDEE12C25D15F5A24AA3BCA88#) , 

8  =>  (LEAD  =>16#1.306#,  TRAIL  =>16#0.000FE0A31B7152DE8D5A46305C85EDECB#) , 

9  =>  (LEAD  =>16#1 .371#,  TRAIL  =>16#0.000A7373AA9CAA7145502F4547987E3E1#) , 

10  =>  (LEAD  =>16#1.3DE#,  TRAIL  =>16#O.OOOA64C12342235B41223E13D773FBA2C#), 

11  =>  (LEAD  =>16#1.44E#,  TRAIL  =>16#0.000086061892D03136F409DF019FBD4F3#) , 

12  =>  (LEAD  =>16#1 .4BF#,  TRAIL  =>16#0 . 000DAD5362A271D4397AFEC42E20E0363#) , 

13  =>  (LEAD  =>16#1.534#,  TRAIL  =>16#0.0002B569D4F81DF0A83C49D86A63F4E67#) , 

14  =>  (LEAD  =>16#1.5AB#,  TRAIL  =>16#0.00007DD48542958C93015191EB345D88D#), 

15  =>  (LEAD  =>16#1.624#,  TRAIL  =>16#0.0007EB03A5584B1F0FA06FD2DA42BB1CE#), 

16  =>  (LEAD  =>16#1.6A0#,  TRAIL  =>16#0.0009E667F3BCC908B2FB1366EA957D3E3#), 

17  =>  (LEAD  =>16#1.71F#,  TRAIL  =>16#0.00075E8EC5F73DD2370F2EF0ACD6CB434#) ,’ 

18  =>  (LEAD  =>16#1.7A1#,  TRAIL  =>16#0.0001473EB0186D7D51023F6CDA1F5EF42#), 

19  =>  (LEAD  =>16#1 .825#,  TRAIL  =>16#0.00089994CCE128ACF88AFAB34A010F6AD#), 

20  =>  (LEAD  =>16#1.8AC#,  TRAIL  =>16#0.000E5422AA0DB5BA7C55A192C9BB3E6ED#),* 

21  =>  (LEAD  =>16#1 .937#,  TRAIL  =>16#0.00037B0CDC5E4F4501C3F2540A22D2FC4#) 

22  =>  (LEAD  =>16#1.9C4#,  TRAIL  =>16#0.0009182A3F0901C7C46B071F2BE58DDAD#) , 

23  =>  (LEAD  =>16#1 .ASS#,  TRAIL  =>16#0.00003B23E255C8B424491CAF87BC8050A#), 

24  =>  (LEAD  =>16#1.AE8#,  TRAIL  =>16#0.0009F995AD3AD5E8734D1773205A7FBC3#) , 

25  =>  (LEAD  =>16#1.B7F#,  TRAIL  =>16#0.00076F2FB5E46EAA7B081AB53C5354C88#), 

26  =>  (LEAD  =>16#1.C19#,  TRAIL  =>16#0 . 0009BDD85529C2220CB12A091BA667944#) ,’ 

27  =>  (LEAD  =>16#1.CB7#,  TRAIL  =>16#0.00020DCEF90691503CBD1E949DB761D9S#),, 

28  =>  (LEAD  =>16#1.D58#,  TRAIL  =>16#0.00018DCFBA48725DA05AEB66E0DCA9F58#), 

29  =>  (LEAD  =>16#1.DFC#,  TRAIL  =>16#0.00097337B9B5EB968CAC39ED291B7225A#), 

30  =>  (LEAD  =>16#1.EA4#,  TRAIL  =>16#0 . 000AFA2A490D9858F73A18F5DB301F86D#) , 

31  =>  (LEAD  =>16#1.F50#,  TRAIL  =>16#0.000765B6E4540674F84B762862BAFF98F#) * ) ; 


-  SCALE  is  a  primitive  function  that  returns  a  value  scaled  by  a 
specified  power  of  two.  The  following  implementation  of 
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—  SCALE  is  temporary,  awaiting  a  package  (possibly  nonportable) 

—  of  primitive  functions  implemented  in  the  most  effective  way 

—  possible  for  a  given  machine. 


function  SCALE  (  X  :  LONGEST_FLOAT;  M  :  LONGEST.INTEGER  ) 
return  LONGEST.FLOAT  is 

FACTOR  :  LONGEST.FLOAT; 

POWERS  :  LONGEST  JFLQAT  :=  1.0; 

MULTIPLIER  :  LQNGEST_FLOAT  :=  1.0; 

I,  J  :  LONGEST_INTEGER ; 

begin 

if  M  >  0  then 
FACTOR  2.0; 

I  :=  M; 
else 

FACTOR  0.5; 

I  :=  -M; 
end  if; 

POWERS  :=  FACTOR; 
while  I  >=  2  loop 
J  :=  I  mod  2; 

I  :=  I  /  2; 
if  J  =  1  then 

MULTIPLIER  :=  MULTIPLIER  *  POWERS; 
end  if; 

POWERS  :=  POWERS  *  POWERS; 
end  loop ; 
if  M  /=  0  then 

MULTIPLIER  :=  MULTIPLIER  *  POWERS; 
end  if; 

return  (  MULTIPLIER*!  ); 
end  SCALE; 

function  EXP(  X  :  FLOATJTYPE  )  return  FLOATJTYPE  is  separate; 
—  other  functions  to  be  added  later 
end  GENERIC _ELEMENTARY_FUNCTIONS ; 


separate  (  GENERIC_ELEMENTARY_FUNCTIONS  ) 
function  EXP(  X  :  FLOATJTYPE  )  return  FLOATJTYPE  is 
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RESULT  :  FLOAT .TYPE; 

L0G2_BY_32_LEAD ,  L0G2_BY_32_TRAIL ,  F, 

Y,  Rl,  R2,  Q,  TMP  :  LONGEST.FLOAT; 

TW0_T0_6  :  constant  :=  64; 

TW0_T0_8  :  constant  :=  256; 

L0G2_BY_32  :  constant  := 

16#5.8B90B_FBE8E_7BCD5_E4FlD_9CC01_F97B5_7A079_A1933_94C#E-2; 
THIRTY _TW0_BY_L0G2  :  constant  := 

16#2E . 2A8EC_A5705_FC2EE_FA 1FF_B41A4_74FA2_3AD5E# ; 
LARGE.THRESHOLD  :  LOH GEST.FLO AT  :  = 

2.0  *  LONGEST _FLOAT(FLOAT_TYPE ’ SAFE.EMAX)  *  6.931471806E-1; 
SMALL.THRESHOLD  :  LONGEST.FLOAT  := 

FLOAT.TYPE * BASE  *  EPSILON ; 

I,  I,  N_1 ,  N_2,  M,  J  :  LONGEST.INTEGER ; 

begin 


—  Step  1.  Filter  out  exceptional  cases. 

Y  :=  LONGEST.FLOAT (  X  ); 
if  abs(Y)  >=  LARGE.THRESHOLD  then 
raise  ARGUMENT.ERROR ; 
elsif  abs(Y)  <=  SMALL.THRESHOLD  then 
return(  FLOAT_TYPE(  1.0  +  Y  )  ); 
end  if; 


—  Step  2.  Reduce  the  argument. 

—  To  perform  argument  reduction,  we  find  the  integer  N  such  that 

X  =  N  *  log2/32  +  remainder,  I  remainder I  <=  log2/64. 

—  I  is  defined  by  round-to-nearest-integer(  X*32/log2  )  and 

—  remainder  by  X  -  N*log2/32.  The  calculation  of  N  is 

--  straightforward  whereas  the  computation  of  X  -  N*log2/32 

—  must  be  carried  out  carefully. 

—  log2/32  is  so  represented  in  two  pieces  that 

(1)  log2/32  is  known  to  extra  precision,  (2)  the  product 
of  N  and  the  leading  piece  is  a  model  number  and  is  hence 

—  calculated  without  error,  and  (3)  the  subtraction  of  the  value 
obtained  in  (2)  from  X  is  a  model  number  and  is  hence  again  obtained 

—  without  error. 

The  following  case  analysis  chooses  the  appropriate 
representation  of  log2/32,  depending  on  the  number  of 

—  digits  in  LON GEST.FLO AT . 
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case  SYSTEM. MAX.DIGITS  is 


when  6  => 

L0G2_BY_32_LEAD  :  =  16#5.8B8#E-2;  — 12  bits 

L0G2_BY_32_TRAIL  :=  L0G2_BY_32  -  16#5 . 8B8#E-2 ; 

when  7 . .  8  => 

L0G2_BY_32_LEAD  :=  16#5.8B9#E-2;  —15  bits 

L0G2_BY_32_TRAIL  :=  L0G2_BY_32  -  16#5.8B9#E-2; 

when  9. .11  => 

L0G2_BY_32_LEAD  :=  16#5.8B908#E-2;  —17  bits 

L0G2_BY_32_TRAIL  :=  L0G2_BY_32  -  16#5 . 8B908#E-2 ; 

when  12.. 14  => 

L0G2_BY_32_LEAD  :=  16#5.8B90A#E-2;  —22  bits 

L0G2_BY_32 .TRAIL  :=  L0G2_BY_32  -  16#5.8B90A#E-2; 

when  15.. 19  => 

L0G2_BY_32_LEAD  :=  16#5.8B90B_F8#E-2;  —28  bits 

L0G2_BY_32_TRAIL  :=  L0G2_BY_32  -  16#5 . 8B90B_F8#E-2 ; 


vhen  20 . . 27  => 

L0G2_BY_32_LEAD  :=  16#5.8B90B_FBE8#E-2;  —37  bits 

L0G2_BY_32_TRAIL  :=  L0G2_BY_32  -  16#5.8B90B_FBE8#E-2; 


when  28.. 33  => 

L0G2_BY_32_LEAD  :=  16#5.8B90B_FBE8E_7A#E-2;  —50  bits 

L0G2_BY_32_TRAIL  :=  L0G2_BY_32  -  16#5 . 8B90B_FBE8E_7A#E-2 ; 

when  others  => 

raise  PR0GRAM_ERR0R ;  —  assumption  (1)  is  violated, 

end  case; 

—  Perform  argument  reduction  in  L0NGEST_FL0AT. 

H  :=  LONGEST_INTEGER(  Y  *  THIRTY _TW0_BY_L0G2  ); 
if  abs(N)  >=  TW0_T0_8  then 
H_2  :=  H  mod  TH0_T0_6; 
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NJL  :=  N  -  N_2; 
else 

N-2  :=  0; 

N_1  :=  N; 
end  if ; 

TMP  :=  L0NGEST-FL0AT(  1.1  )  *  L0G2-B Y-32-LEAD ; 
if  abs(  Y  )  >=  abs(  TMP  )  then 
R1  :=  Y  -  TMP; 
else 

TMP  :=  0.5  *  TMP; 

Rl  :=  (Y  -  TMP)  -  TMP; 
end  if; 

if  N_2  /=  0  then 

Rl  :=  Rl  -  L0NGEST-FL0 AT (  N-2  )  *  L0G2-BY.32JLEAD ; 
end  if; 

R2  :=  -L0NGEST-FL0 AT ( N )  *  L0G2.BY-32-TRAIL ; 

J  :=  N  mod  32; 

M  :=  (H  -  J)/32; 

—  Step  3.  Approximation. 

—  The  following  is  the  core  approximation.  We  approximate 

--  exp(Rl+R2)-l  by  a  polynomial.  The  case  analysis  finds  both 

—  a  suitable  floating-point  type  (less  expensive  to  use  than 
L0NGEST_FL0AT)  and  an  appropriate  polynomial  approximation 

—  that  will  deliver  a  result  accurate  enough  with  respect  to 
FLOAT-TYPE* BASE9 DIGITS.  Note  that  the  upper  bounds  of  the 

—  cases  below  (6,  15,  16,  18,  27,  and  33)  are  attributes 

—  of  predefined  floating  types  of  common  systems. 

case  FLOAT-TYPE* BASE* DIGITS  is 
when  1 . . 6  => 
declare 

type  WORKING-FLOAT  is  digits  6; 

R,  POLY  :  VORKING.FLOAT; 
begin 

R  :=  WORKING-FLOAT (  Rl  +  R2  ) ; 

POLY  R*R*(  5. 00004-0481E-01  +  R  *  1 .66667.6443E-01  ); 
Q  :=  Rl  +  (  R2  +  L0NGEST-FL0 AT (  POLY  )  ); 
end; 

when  7.. 15  => 
declare 

type  WORKING-FLOAT  is  digits 

( 15+SYSTEM . MAX_DIGITS  -  abs( 15-SYSTEM. MAX_DIGITS))/2; 
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—  this  is  min(  15,  SYSTEM. MAX.DIGITS  ) 

R,  POLY  :  WORKING .FLOAT; 
begin 

R  :=  WORKING_FLOAT(  R1  +  R2  ); 

POLY  :=  R*R*(  5. 00000.00000 _00000_08883E-01  + 

R*(  1. 66666 .66666.52608.78863E-01  + 

R*(  4.16666.66666.22607.95726E-02  + 

R*(  8. 33336.79843.42 196.16221E-03  + 

R*(  1. 38889 .49086.37771.99667E-03  ))))); 

Q  :=  R1  +  (  R2  +  LONGEST.FLOAT (  POLY  )  ); 
end; 

when  16  => 
declare 

type  WORKING.FLOAT  is  digits 

( 16+SYSTEM . MAX.DIGITS  -  abs(16-SYSTEM.MAX_DIGITS))/2; 

R,  POLY  :  WORKING.FLOAT ; 
begin 

R  :=  WORKING.FLOAT (  R1  +  R2  ) ; 

POLY  :=  R*R*(  5. 00000.00000 .00000.08883E-01  + 

R*(  1. 66666 .66666.52608.78863E-01  + 

R*(  4. 16666 .66666.22607.95726E-02  + 

R*(  8.33336.79843.42196.16221E-03  + 

R*(  1 . 38889.49086.37771.99667E-03  ))))); 

Q  :=  R1  +  (  R2  +  LONGEST.FLOAT (  POLY  )  ); 
end; 


when  17.. 18  => 
declare 

type  WORKING.FLOAT  is  digits 

( 1 8+SYSTEM . MAX.DIGITS  -  abs(18-SYSTEM.MAX_DIGITS))/2; 

R,  POLY  :  WORKING.FLOAT; 
begin 

R  :=  WORKING.FLOAT (  R1  +  R2  ); 

POLY  :=  R*R*(  5. 00000.00000 .00000.07339E-01  + 

R*(  1. 66666 .66666.66666.69177E-01  + 

R*(  4. 16666.66666.28680.32559E-02  + 

R*(  8. 33333.33332.52083.91 118E-03  + 

R*(  1 . 38889.44766.51246.30293E-03  + 

R*(  1 . 98413.53190.32208.33704E-04  )))))); 

q  :=  R1  +  (  R2  +  LONGEST.FLOAT (  POLY  )  ); 
end; 

when  19.. 27  => 
declare 
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type  WORKING.FLOAT  is  digits 

(27+SYSTEM.MAX_DIGITS  -  abs(27-SYSTEH.MAX_DIGITS))/2; 
R,  POLY  :  WORKING.FLOAT; 
begin 

R  :=  WORKING_FLOAT (  R1  +  R2  ); 

POLY  :=  R*R*(  4.99999_99999_99999_99999_99636_2107BE-0i  + 
R*(  1 . 66666_66666_66666_66666_66512_04136E-01  + 
R*(  4. 16666_66666_66666_69681_E9325_03184E-02  + 
R*(  8 . 33333_33333_33333_40906_33326_46233E-03  + 
R*(  1. 38888 _88888_81124_92492_26093_01620E-03  + 
R*(  1 . 98412_69841_13983_54303_69568_16543E-04  + 
R*(  2.48016_66086_20855_39725_92760_56125E-05  + 
R*(  2.75574_13983_5i388_82843_29291  74995E-06 
)))))»); 

q  :=  Ri  +  (  R2  +  LOHGEST_FLOAT(  POLY  )  ); 
end; 

when  28 . . 33  => 
declare 

type  WORKINGJFLOAT  is  digits 

(33+SYSTEM . MAX_DIGITS  -  abs (33-SYSTEM. MAX_DIGITS))/2; 
R,  POLY  :  W0RKIHG_FL0AT ; 
begin 

R  :=  WORKING .FLOAT (  Rl  +  R2  ); 

POLY  :=  R*R*(  5.0E-01  + 

R*(  1 . 66666_66666_66666_66666_66666_66668_18891E-01  + 

R*(  4. 16666_66666_66666_66666_66666_66671_98062E-02  + 

R*(  8 . 33333_33333_33333_33333_33182_72433_96473E-03  + 

R*(  1 . 38888_88888_88888_88888_88860_77788_96115E-03  + 

R*(  1 . 98412_69841_26984_13216_98830_39302_820E-04  + 

R*(  2.4801B_87301_B8730_16549_32617_44006_810E-05  + 

R*(  2.7B573_19223_90497_BOB21_23337_44713_411E-06  + 

R*(  2 . 75573_19223_90383_09381_24S3i_22474_208E-07  + 

R*(  2.S0521_67036_89710_14700_24557_8863S_3SlE-08  + 

R*(  2.08768_06002_87469_73970_46716  40247  S97E-09 

))))))))))); 

Q  :=  Rl  +  (  R2  +  LOHGEST_FLOAT(  POLY  )  ); 
end; 

when  others  => 

raise  PROGRAM.ERROR;  —  assumption  (1)  is  violated, 
end  case; 


This  completes  the  core  approximation. 
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—  Step  4.  Function  value  reconstruction. 

—  We  now  reconstruct  the  exponential  of  the  input  argument. 

—  The  order  of  the  computation  below  must  be  strictly  observed. 

F  :=  TWO_TO_ J_BY_32 (  J,  LEAD  )  +  TW0_T0_J_BY_32(  J,  TRAIL  ); 

case  LONGEST .FLOAT' MACHIHE_RADIX  is 
when  2  => 

Y  :=  TW0_T0_J_BY_32(  J,  LEAD  )  + 

(  TWO_TO_ J_BY_32 (  J,  TRAIL  )  +  F*q  ); 

RESULT  :=  FLOAT_TYPE(  SCALE (  Y,  M  )  ); 

when  16  => 

Y  :=  SCALE (  TW0_T0_J_BY_32(  J,  LEAD  ),  M  )  + 

SCALE (  TW0_T0_ J _BY_32 (  J,  TRAIL  )  +  F*q,  M  ); 

RESULT  :=  FL0AT_TYPE(  Y  ); 

when  others  => 

raise  PROGRAM.ERROR ;  —  assumption  (1)  is  violated, 

end  case; 

if  RESULT  /=  0.0  then 
return (  RESULT  ) ; 
else 

raise  ARGUMENT .ERROR; 
end  if; 

exception 

when  NUMERI C .ERROR  |  CONSTRAINT.ERROR  => 
raise  ARGUMENT. ERROR; 

—  This  handling  may  be  changed  in  the  future.  For  one 

—  thing,  overflowing  the  constraints  of  the  base  type 

—  of  FLOAT.TYPE  and  overflowing  the  constraints  of 

—  FLOAT.TYPE  are  indistinguishable  in  the  way  exception 

—  is  handled  now. 


end  EXP; 
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